Add ContextBench harness core by PatrickSys · Pull Request #120 · PatrickSys/codebase-context

PatrickSys · 2026-04-29T16:50:05Z

Summary

Adds the non-claim-bearing ContextBench harness runner, retrieval gate, structured answer parsing, scoring, artifact, trajectory, and Phase 42 evidence-gate utilities.
Adds harness tests and fixtures for baseline snapshots/runs, schema enforcement, setup/index evidence, official-evaluator handling, lane isolation, scoring, and verification failure modes.
Stabilizes Windows hook execution for ContextBench temp Git repos and slow search integration tests without changing benchmark claims.

Verification

rtk node scripts/contextbench-runner.mjs --validate-fixtures
rtk node scripts/contextbench-runner.mjs --validate-lane-setup
rtk pnpm exec vitest run tests/contextbench-runner-contract.test.ts tests/contextbench-lane-setup.test.ts tests/contextbench-scoring.test.ts tests/contextbench-trajectory.test.ts tests/contextbench-baseline-schema-gate.test.ts tests/contextbench-baseline-snapshot.test.ts tests/contextbench-baseline-runner.test.ts tests/contextbench-phase42-evidence-gate.test.ts tests/contextbench-protocol.test.ts tests/contextbench-task-manifest.test.ts
rtk pnpm run format:check
rtk pnpm exec tsc --noEmit
rtk pnpm run build
Pre-push hook completed successfully during rtk git push -u origin pr/contextbench-harness-core

Claim Posture

This PR adds harness infrastructure only.
It does not run live benchmark rows, flip claimAllowed, or claim Phase 42/product improvement success.
Existing diagnostic artifacts remain non-claim-bearing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cad646d9d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-29T16:53:46Z

+  };
+  const rawTrace = {
+    executor,
+    model: executor === 'claude' ? model : 'fake-executor',


Preserve actual executor model in raw trace

Set rawTrace.model to the selected model for all real executors, not just Claude. As written, non-Claude runs (codex, gemini, opencode) are recorded as "fake-executor" while the manifest row stores taskExecution.model from --model, so Phase 42 provenance checks (rawTrace.model === row.taskExecution.model) will fail even when the run is otherwise valid, blocking claim-grade verification for those lanes.

Useful? React with 👍 / 👎.

Fixed in 867ac70: raw traces now record model: executor === 'fake' ? 'fake-executor' : model, so Codex/Gemini/OpenCode preserve the selected model. The adapter smoke test now asserts
awTrace.model === row.taskExecution.model and executor consistency for all three adapters.

chatgpt-codex-connector · 2026-04-29T16:53:46Z

+  return {
+    laneId: laneCard.laneId,
+    proven,
+    sourceKind: telemetry?.proofSource ? 'env_override' : 'not_captured',


Emit claim-eligible lane isolation source kind

When telemetry evidence is present, this always emits sourceKind: 'env_override', but the Phase 42 gate explicitly rejects env_override as insufficient lane-isolation proof. That means rows with otherwise good observed-tools evidence can never satisfy lane isolation in claim verification, because the producer and verifier disagree on the accepted source kind.

Useful? React with 👍 / 👎.

Already addressed in the current branch: lane telemetry preserves sourceKind from the evidence source, including proxy and ranscript, instead of collapsing everything to �nv_override. The Phase 42 gate still rejects �nv_override as diagnostic-only, so claim-grade rows require explicit proxy/transcript evidence.

Clarifying previous reply: this was already addressed in the branch. Lane telemetry now preserves the evidence source kind from the telemetry payload, including proxy and transcript, instead of treating every telemetry-backed row as env_override. The verifier still rejects env_override for claim-grade lane proof.

greptile-apps · 2026-04-29T16:54:43Z

Greptile Summary

This PR adds the non-claim-bearing ContextBench harness: runner, retrieval gate, structured answer parser, scoring module, artifact utilities, evidence gate, trajectory normaliser, and a comprehensive test suite. The two previously-flagged regressions (scoring field mismatches and hardcoded setupDurationMs) are addressed.

P1 — scorer/gate artifact contract gap: scoreWithOfficialEvaluatorFirst writes stdout/stderr as inline text, but hasOfficialEvaluatorProof unconditionally requires stdoutPath and stderrPath to be populated paths with matching SHA-256 hashes. Any score artifact produced by the TypeScript scorer will always fail the official_evaluator_missing gate check. The evidence gate test bypasses this by constructing paths manually in passingArtifacts(), so the gap is not caught by existing tests.

Confidence Score: 4/5

Safe to merge as non-claim-bearing infrastructure; the P1 scorer/gate artifact gap must be resolved before claim-bearing runs are attempted.

One P1 defect: the TypeScript scorer writes stdout/stderr as inline text but the evidence gate requires stdoutPath/stderrPath file paths, so any real scorer artifact will permanently fail hasOfficialEvaluatorProof. Since claimAllowed is false throughout this PR the gate is never exercised end-to-end yet, keeping the PR safe to land as infrastructure — but the gap must be closed before the claim path is activated.

src/eval/contextbench-scoring.ts — ContextBenchScoreResult must add stdoutPath/stderrPath and the function must write stdout/stderr to separate log files.

Important Files Changed

Filename	Overview
src/eval/contextbench-scoring.ts	Scorer emits inline stdout/stderr text but evidence gate requires stdoutPath/stderrPath file paths — gate will always emit official_evaluator_missing for any artifact produced by this module.
src/eval/contextbench-evidence-gate.ts	Evidence gate logic is thorough and well-structured; all gate checks (official evaluator, lane isolation, setup/index cost, runner provenance, denominator contract) are coherent and correctly gated by evidenceMode.
src/eval/contextbench-artifacts.ts	buildManifestRow now accepts caller-provided setupIndex; scoring fields are deliberately hardcoded to non-claim-bearing values for Phase 38 smoke runs, consistent with test assertions.
src/eval/contextbench-trajectory.ts	Trajectory normalisation is correct; pred_steps[0].spans and pred_spans share the same object reference, which could be problematic if consumers mutate the trajectory output.
tests/contextbench-phase42-evidence-gate.test.ts	Comprehensive gate test coverage; passingArtifacts() constructs stdoutPath/stderrPath manually, masking the gap between the TypeScript scorer's output and the gate's requirements.
tests/contextbench-scoring.test.ts	Tests cover scorer return value fields and fallback metadata well, but do not verify that the written score JSON artifact satisfies the evidence gate's stdoutPath/stderrPath requirements.
tests/contextbench-runner-contract.test.ts	Runner contract tests cover fixture validation, fake-executor smoke runs, manifest append semantics, and setupIndex propagation cleanly.

Sequence Diagram

sequenceDiagram
    participant Runner as contextbench-runner.mjs
    participant Scorer as scoreWithOfficialEvaluatorFirst (TS)
    participant Disk as Score Artifact (score.json)
    participant Gate as evaluateContextBenchEvidenceGate

    Runner->>Scorer: run official evaluator
    Scorer->>Disk: writeJson(outputPath, { stdout, stderr, exitCode, ... })
    Note over Disk: stdoutPath/stderrPath absent
    Runner->>Gate: artifactsByRunId[runId].score = parse(score.json)
    Gate->>Gate: hasOfficialEvaluatorProof(row, score, hashes)
    Note over Gate: checks score.stdoutPath → undefined → returns false
    Gate-->>Runner: official_evaluator_missing failure

_{Reviews (2): Last reviewed commit: "fix(test): harden ContextBench schema cl..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T16:54:47Z

+  missingEvidenceFiles: string[];
+  unsupportedClaim: boolean;
+  falseReady: boolean;
+  reasons: string[];
+}
+
+function writeJson(filePath: string, value: unknown): void {
+  mkdirSync(path.dirname(filePath), { recursive: true });
+  writeFileSync(filePath, `${JSON.stringify(value, null, 2)}\n`, 'utf8');
+}
+
+export async function scoreWithOfficialEvaluatorFirst(
+  params: OfficialEvaluatorParams
+): Promise<ContextBenchScoreResult> {
+  const args = [
+    '-m',
+    'contextbench.evaluate',
+    '--gold',
+    params.goldPath,
+    '--pred',
+    params.predictionPath
+  ];
+  if (params.cachePath) args.push('--cache', params.cachePath);
+  args.push('--out', params.outputPath);
+  const command = `python ${args.join(' ')}`;
+  const result = await params.runner('python', args, params.cwd);
+  if (result.status === 0) {
+    const score = {
+      status: 'completed' as const,
+      mode: 'official_evaluator' as const,
+      claimBearing: true,


ContextBenchScoreResult is incompatible with ContextBenchScoreEvidence

scoreWithOfficialEvaluatorFirst returns (and writes) a score object with exitStatus, but ContextBenchScoreEvidence (consumed by hasOfficialEvaluatorProof in the evidence gate) expects exitCode. Additionally, officialEvaluatorInvoked is absent from ContextBenchScoreResult. Because of these two mismatches, any TypeScript harness that stores this function's return value as the score artifact will cause hasOfficialEvaluatorProof to always return false — permanently blocking the claim gate even for a valid run.

The runner .mjs correctly emits both exitCode and officialEvaluatorInvoked: true inline (lines ~1091–1120), but the TypeScript module diverges silently. The two representations need to be reconciled.

Addressed in the current branch before this latest push: ContextBench scoring now emits the gate-compatible evaluator fields, including exitCode, officialEvaluatorFirst, officialEvaluatorAttempted, officialEvaluatorInvoked, command, outputPath, stdoutPath, and stderrPath. The scorer tests cover claimAllowed false versus true behavior and the metadata contract.

greptile-apps · 2026-04-29T16:54:48Z

+  return content.split('\n').map((line) => JSON.parse(line) as ContextBenchRunManifestRow);
+}
+
+export function buildManifestRow(params: {
+  runId: string;
+  protocolVersion: string;
+  protocolHash: string;
+  taskManifestHash: string;
+  laneCard: ContextBenchLaneToolCard;
+  task: ContextBenchTaskIdentity;
+  repeatIndex: number;
+  status: ContextBenchTerminalStatus;
+  startedAt: string;
+  completedAt: string;
+  paths: ArtifactPathSet;
+  hashes: Record<string, string>;
+  executor: ContextBenchExecutor;
+  model: string;
+  timeoutSeconds: number;
+  maxContextTokens: number;
+  maxAnswerTokens: number;
+}): ContextBenchRunManifestRow {
+  return {


buildManifestRow hardcodes setupDurationMs/indexDurationMs to 0

hasMeasuredSetupIndex in the evidence gate checks if (evidence.setupStatus === 'completed' && setupDuration <= 0) return false, so any row emitted by this function whose setupStatus is 'completed' will permanently fail the setup_index_cost_missing gate. Additionally, the cross-check row.setupIndex.setupDurationMs !== evidence.setupDurationMs compares the row's hardcoded 0 against the artifact's actual measured value, causing a guaranteed mismatch.

The runner script uses a separate buildManifestRowForArtifacts that receives actual measurements; this TypeScript helper should similarly require measured values rather than silently defaulting to 0.

Addressed in the current branch: buildManifestRow now requires caller-provided setupIndex evidence instead of hardcoding setup/index durations to zero, and the runner contract test asserts setupIndex propagation into manifest rows.

greptile-apps · 2026-04-29T16:54:49Z

  isError?: boolean;
 };

+const SLOW_WINDOWS_TEST_TIMEOUT_MS = 60000;


Duplicate constant across two test files

SLOW_WINDOWS_TEST_TIMEOUT_MS is defined identically in both search-decision-card.test.ts and search-snippets.test.ts. Consider extracting it to a shared test-helper (e.g., tests/helpers/timeouts.ts) to keep the value in one place.

Acknowledged, but not changing for this PR. The duplicate timeout constant is test-local and non-blocking; extracting a shared helper would add a new file for two call sites while the active merge blockers here are ContextBench evidence/provenance contracts.

…s' into pr/contextbench-harness-core

greptile-apps · 2026-04-29T19:33:01Z

+export interface ContextBenchScoreResult {
+  status: 'completed' | 'judge_failed';
+  mode: 'official_evaluator' | 'diagnostic_fallback';
+  claimBearing: boolean;
+  command: string;
+  stdout: string;
+  stderr: string;
+  exitStatus: number | null;
+  exitCode: number | null;
+  officialEvaluatorFirst: boolean;
+  officialEvaluatorAttempted: boolean;
+  officialEvaluatorInvoked: boolean;
+  outputPath: string;
+  fallbackReason?: string;
+}
+
+export interface FactRecallDiagnosticResult {
+  missingRequiredFacts: string[];
+  missingEvidenceFiles: string[];
+  unsupportedClaim: boolean;
+  falseReady: boolean;
+  reasons: string[];
+}
+
+function writeJson(filePath: string, value: unknown): void {
+  mkdirSync(path.dirname(filePath), { recursive: true });
+  writeFileSync(filePath, `${JSON.stringify(value, null, 2)}\n`, 'utf8');
+}
+
+export async function scoreWithOfficialEvaluatorFirst(
+  params: OfficialEvaluatorParams
+): Promise<ContextBenchScoreResult> {
+  const args = [
+    '-m',
+    'contextbench.evaluate',
+    '--gold',
+    params.goldPath,
+    '--pred',
+    params.predictionPath
+  ];
+  if (params.cachePath) args.push('--cache', params.cachePath);
+  args.push('--out', params.outputPath);
+  const command = `python ${args.join(' ')}`;
+  const result = await params.runner('python', args, params.cwd);
+  if (result.status === 0) {
+    const score = {
+      status: 'completed' as const,
+      mode: 'official_evaluator' as const,
+      claimBearing: params.claimAllowed === true,
+      command,
+      stdout: result.stdout,
+      stderr: result.stderr,
+      exitStatus: result.status,
+      exitCode: result.status,
+      officialEvaluatorFirst: true,
+      officialEvaluatorAttempted: true,
+      officialEvaluatorInvoked: true,
+      outputPath: params.outputPath
+    };
+    writeJson(params.outputPath, score);
+    return score;
+  }
+
+  const score = {
+    status: 'judge_failed' as const,
+    mode: 'diagnostic_fallback' as const,
+    claimBearing: false,
+    command,
+    stdout: result.stdout,
+    stderr: result.stderr,
+    exitStatus: result.status,
+    exitCode: result.status,
+    officialEvaluatorFirst: true,
+    officialEvaluatorAttempted: true,
+    officialEvaluatorInvoked: true,
+    outputPath: params.outputPath,
+    fallbackReason: 'official_evaluator_failed'
+  };
+  writeJson(params.outputPath, score);
+  return score;


Scorer artifact missing stdoutPath/stderrPath; evidence gate will always reject it

scoreWithOfficialEvaluatorFirst writes stdout and stderr as inline raw text fields in the score JSON artifact. But hasOfficialEvaluatorProof in the evidence gate unconditionally checks all three of these conditions:

typeof score.stdoutPath === 'string' && score.stdoutPath.length > 0 && hasSha256Hash(artifactHashesByPath[score.stdoutPath]) && typeof score.stderrPath === 'string' && score.stderrPath.length > 0 && hasSha256Hash(artifactHashesByPath[score.stderrPath])

Because ContextBenchScoreResult has no stdoutPath/stderrPath fields, the serialised score artifact will always have stdoutPath === undefined, causing hasOfficialEvaluatorProof to return false and permanently emitting an official_evaluator_missing failure—even for a successful, claim-allowed run.

The evidence gate test constructs stdoutPath/stderrPath by hand in passingArtifacts(), so this gap is not caught by the existing scorer tests. The scorer must write stdout/stderr to separate log files and include their paths in the score artifact for the gate contract to close.

PatrickSys added 5 commits April 29, 2026 18:06

test(eval): add ContextBench harness core

ffa7e73

fix(format): format ContextBench harness sources

b2fa208

fix(test): isolate ContextBench baseline Git env

6aed9d1

fix(test): tolerate ContextBench temp cleanup races

0360cb9

fix(test): relax slow Windows search timeouts

cad646d

chatgpt-codex-connector Bot reviewed Apr 29, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

PatrickSys added 7 commits April 29, 2026 20:08

fix(eval): align ContextBench harness evidence contracts

4513979

fix(test): tolerate ContextBench schema cleanup races

a155d56

fix(test): tolerate ContextBench runner cleanup races

c027703

fix(test): relax zombie guard timeout jitter

5a5bf68

fix(eval): preserve ContextBench executor model provenance

867ac70

Merge remote-tracking branch 'origin/pr/contextbench-protocol-fixture…

97bfe24

…s' into pr/contextbench-harness-core

fix(test): harden ContextBench schema cleanup

c5a74af

PatrickSys changed the base branch from pr/contextbench-protocol-fixtures to master April 29, 2026 19:28

PatrickSys closed this Apr 29, 2026

PatrickSys reopened this Apr 29, 2026

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

PatrickSys merged commit 9e09dad into master Apr 29, 2026
4 checks passed

PatrickSys deleted the pr/contextbench-harness-core branch April 30, 2026 07:32

Conversation

PatrickSys commented Apr 29, 2026

Summary

Verification

Claim Posture

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

PatrickSys Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading